Machine Learning - Assignment 1

Author: Benjamin Brodwolf (benjamin.brodwolf@students.fhnw.ch)

Task 1

"Load and visualize the dataset house_data.csv."
options = delimitedTextImportOptions("Delimiter",{'"',','},"ConsecutiveDelimitersRule","join", "DataLines",2, "NumVariables",21, "ExtraColumnsRule","ignore", "LeadingDelimitersRule","ignore", "VariableNamesLine",1);
options = setvartype(options,2,"datetime");
options = setvartype(options,[1,3:8,13:21],'double');
options = setvartype(options,[11,12,15,16,17],"categorical");
options = setvartype(options,[9,10],"logical");
options = setvaropts(options,2,"InputFormat",'yyyyMMdd''T''HHmmss',"DatetimeFormat",'d MMM yyyy');
data = readtable("house_data.csv",options)
data = 21613×21 table
 iddatepricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontviewconditiongradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
17.1293e+0913 Oct 201422190031.0000118056501003711800195509817847.5112-122.257013405650
26.4141e+099 Dec 201453800032.250025707242200372170400195119919812547.7210-122.319016907639
35.6315e+0925 Feb 201518000021.000077010000100367700193309802847.7379-122.233027208062
42.4872e+099 Dec 201460400043.000019605000100571050910196509813647.5208-122.393013605000
51.9544e+0918 Feb 201551000032.0000168080801003816800198709807447.6168-122.045018007503
67.2376e+0912 May 2014122500044.5000542010193010031138901530200109805347.6561-122.00504760101930
71.3214e+0927 Jun 201425750032.2500171568192003717150199509800347.3097-122.327022386819
82.0080e+0915 Jan 201529185031.5000106097111003710600196309819847.4095-122.315016509711
92.4146e+0915 Apr 201522950031.000017807470100371050730196009814647.5123-122.337017808113
103.7935e+0912 Mar 201532300032.5000189065602003718900200309803847.3684-122.031023907570
111.7368e+093 Apr 201566250032.5000356097961003818601700196509800747.6007-122.145022108925
129.2129e+0927 May 201446800021.00001160600010047860300194209811547.6900-122.292013306000
1311410151628 May 201431000031.00001430199011.5000004714300192709802847.7558-122.2290178012697
146.0547e+097 Oct 201440000031.7500137096801.0000004713700197709807447.6127-122.0450137010208
151.1750e+0912 Mar 201553000052.0000181048501.5000003718100190009810747.6700-122.394013604850
169.2973e+0924 Jan 201565000043.0000295050002.000000391980970197909812647.5714-122.375021404000
171.8755e+0931 Jul 201439500032.00001890140402.0000003718900199409801947.7277-121.9620189014018
186.8652e+0929 May 201448500041.0000160043001.5000004716000191609810347.6648-122.343016104300
19160003975 Dec 201418900021.0000120098501.0000004712000192109800247.3089-122.210010605095
207.9832e+0924 Apr 201523000031.0000125097741.0000004712500196909800347.3343-122.306012808850
216.3005e+0914 May 201438500041.7500162049801.00000047860760194709813347.7025-122.341014004980
222.5240e+0926 Aug 2014200000032.75003050448671.000000392330720196809804047.5316-122.2330411020336
237.1380e+093 Jul 201428500052.5000227063002.0000003822700199509809247.3266-122.169022407005
248.0914e+0916 May 201425270021.5000107096431.0000003710700198509803047.3533-122.166012208386
253.8147e+0920 Nov 201432900032.2500245065002.0000004824500198509803047.3739-122.172022006865
261.2020e+093 Nov 201423300032.0000171046971.5000005617100194109800247.3048-122.218010304705
271.7945e+0926 Jun 201493700031.7500245026912.000000381750700191509811947.6386-122.360017603573
283.3037e+091 Dec 201466700031.0000140015811.5000005814000190909811247.6221-122.314018603861
295.1014e+0924 Jun 201443800031.7500152063801.00000037790730194809811547.6950-122.304015206235
301.8731e+092 Mar 201571900042.5000257071732.0000003825700200509805247.7073-122.110026306026
318.5628e+0910 Nov 201458050032.5000232039802.0000003823200200309802747.5391-122.070025803980
322.4260e+091 Dec 201428000021.5000119012653.0000003711900200509813347.7274-122.357013901756
3346100039024 Jun 201468750041.7500233050001.500000471510820192909811747.6823-122.368014605000
347.5892e+0910 Nov 201453500031.0000109030001.5000004810900192909811747.6889-122.375015705080
357.9551e+093 Dec 201432250042.7500206066591.000000371280780198109805847.4276-122.157020208720
369.5472e+0913 Jun 201469600032.5000230030601.500000381510790193020029811547.6827-122.310015903264
379.4353e+0928 May 201455000041.00001660348481.00000015930730193309805247.6621-122.1320216011467
382.7680e+0930 Dec 201464000042.0000236060002.0000004823600190409810747.6702-122.362017304700
397.8955e+0913 Feb 201524000041.0000122080751.00000027890330196909800147.3341-122.282012907800
402.0785e+0920 Jun 201460500042.5000262075532.0000003826200199609805647.5301-122.1800262011884
415.5477e+0915 Jul 201462500042.5000257055202.0000003925700200009807447.6145-122.027024705669
427.7662e+0911 Aug 201477500042.25004220241861.0000003826001620198409816647.4450-122.3470241030617
437.2032e+097 Jul 201486199052.7500359556392.0000003935950201409805347.6848-122.016036255639
449.2702e+0928 Oct 201468500031.0000157022802.0000003715700192209811947.6413-122.364015802640
451.4327e+0929 Jul 201430900031.0000128096561.00000046920360195909805847.4485-122.175013408808
468.0354e+0918 Jul 201448800032.50003160136032.0000003831600200309801947.7443-121.977030509232
478.9452e+0925 Mar 201521049031.000099085281.000000369900196609802347.3066-122.371012288840
484.1783e+0916 Jul 201478500042.50002290134162.0000004922900198109800747.6194-122.1510268013685
499.2154e+0928 Apr 201545000031.7500125059631.0000004712500195309811547.6796-122.30109705100
5082203908411 Mar 2015135000032.50002753650051.000000592165588195309807047.4041-122.4510268072513
515.2456e+0916 Sep 201422800031.0000119091991.0000003711900195509814847.4258-122.322011909364
527.2313e+0917 Feb 201534500052.5000315091341.0000004816401510196609805647.4934-122.189019909133
537.5185e+0931 Dec 201460000031.7500141040801.000000471000410195009811747.6808-122.384014104080
543.6260e+095 Feb 201558500021.7500198085501.00000037990990198109811747.6989-122.369014806738
554.2174e+093 Mar 201592000052.2500273060001.500000382130600192709810547.6571-122.281027306000
569.8227e+0912 May 201488500042.5000283050002.0000003928300199509810547.6597-122.290019505000
579.4785e+0919 Aug 201429250042.5000225044952.0000003722500200809804247.3663-122.114022504500
582.7998e+097 Apr 201530100032.5000242047502.0000003824200200309804247.3663-122.122026904750
597.9228e+0927 Aug 201495100053.25003250143422.0000004832500196809800847.5880-122.1160296011044
608.0790e+0923 Feb 201543000043.0000185099762.0000003818500199109805947.5059-122.149022708542
611.5160e+0910 Dec 201465000032.25002150212351.000000481590560195909816647.4336-122.3390257018900
629.5582e+0928 Aug 201428900031.7500126084001.0000003712600195409814847.4366-122.335012908750
635.0724e+0921 Oct 201450500031.7500251986902.0000005825190197309816647.4428-122.344025009500
649.5281e+097 Dec 201454900031.7500154010443.0000003815400201409811547.6765-122.320015803090
651.1890e+093 Jun 201442500032.2500166060001.000000371110550197909812247.6113-122.297014404080
663.2535e+0920 Nov 201431762532.7500277038091.5000005717701000192509814447.5747-122.304014404000
673.3941e+099 Sep 201497500042.50002720110492.00000031027200198909800447.5815-122.1920275011049
683.7170e+099 Oct 201428700042.5000224046482.0000003722400200509800147.3378-122.257022214557
691.2745e+0925 Aug 201420400031.00001000120701.0000004710000196809804247.3621-122.1100101012635
701.8020e+0912 Jun 2014132500052.25003200201581.0000003816001600196509800447.6303-122.2150339020158
711.5251e+0912 Sep 2014104000053.25004770500941.00000041130701700197309800547.6525-122.1600353038917
721.0490e+095 Jan 201532500032.0000126056121.0000004712600197209803447.7362-122.179016404745
738.8209e+0910 Jun 201457100042.0000275078071.500000572250500191609812547.7168-122.287015107807
745.4165e+0910 Jul 201436000042.5000238050002.0000003823800200509803847.3608-122.036024205000
753.4441e+0916 Mar 201534900031.75001790505291.000000571090700196509804247.3511-122.0730194050529
763.2769e+095 Nov 201483250044.00003430351022.00000041023901040198609807547.5822-121.9870324035020
774.0368e+0913 Oct 201438000041.7500176073001.00000037880880195609800847.6034-122.125016807500
782.3916e+0920 Apr 201548000031.0000104050601.0000003710400194109811647.5636-122.39408905060
796.3000e+099 Jun 201441000031.0000141050601.00000047910500195609813347.7073-122.340011305693
801.5310e+0923 Mar 201572000042.50003450396832.00000031034500200209801047.3420-122.0250335039750
815.1045e+092 Dec 201439000032.5000235051002.0000003823500200309803847.3512-122.008023505363
827.4371e+0922 Dec 201436000042.5000190058892.0000003719000199209803847.3490-122.031018706405
839.4184e+0928 Oct 201435500021.0000202067201.0000003710101010194809811847.5474-122.291017206720
841.5231e+0928 Jan 201535600031.5000168087121.0000003816800196409805947.4811-122.149018508797
851.1330e+092 Jun 201431500031.000096066341.000000369600195209812547.7264-122.310015707203
864.2329e+0914 Nov 201494000031.5000214036002.000000391900240192509811947.6337-122.365020204800
872.5990e+093 Nov 201430500052.2500266084001.5000005726600196109809247.2909-122.189015908165
883.3421e+0918 Jun 201446100033.2500277062782.000000391980790200609805647.5228-122.199019007349
891.3327e+0919 May 201421500022.2500161020402.0000004716100197909805647.5180-122.194019502025
903.8699e+094 Sep 201433500021.7500103010662.00000037765265200609813647.5394-122.387010301106
912.7915e+0922 May 201424350042.5000198074032.0000003719800198809802347.2897-122.372019807510
925.0363e+0911 Mar 2015109988052.7500352063532.00000041035200200109819947.6506-122.391025206250
934.1680e+0926 Feb 201515300031.00001200105001.0000003712000196209802347.3220-122.3510135010500
946.0215e+0925 Jul 201443000031.5000158050001.000000381290290193909811747.6870-122.386015704500
956.0215e+0923 Dec 201470000031.5000158050001.000000381290290193909811747.6870-122.386015704500
961.4833e+098 Sep 201490500042.50003300102501.000000372390910194619919804047.5873-122.249019506045
973.4220e+0930 Mar 201524750031.75001960156811.0000003719600196709803247.3576-122.2770175015616
981.0996e+0912 Sep 201419900041.5000116064001.0000004711600197509802347.3036-122.378011606400
9972207910411 Jul 201431400031.75001810418001.000000571210600198009803847.4109-121.95801650135036
1007.3382e+0916 May 201443750032.50002320368472.0000003923200199209804547.4838-121.7140255035065

Task 2

"Estimate ‘price‘ using 'sqft_living‘ as the only variable. Visualize the regression line together with the datapoints. Generate a Tukey-Anscombe-Plot and interpret what it shows."
% loading the prices and sqft_living
prices = table2array( data(:,"price") );
sqft_living = table2array( data(:, "sqft_living"));
figure;
% Plot the data sqft_living and prices
plot(sqft_living, prices, 'o', 'MarkerSize', 1);
title({'Task2','Plot Price & Square feet of Living'});
ylabel('House Prices on 10,000s');
xlabel('Living Areas Sq. Ft.');
m = length(prices);
% Add intercept term to sqliving
sqft_living = [ones(m, 1) sqft_living];
% Estimate the prices with sqft_living
theta = normalEqn(sqft_living, prices);
fprintf('Theta computed from the normal equations:\n%f,\n%f',theta(1),theta(2))
Theta computed from the normal equations: -43580.743095, 280.623568
% Plot the linear fit
hold on; % keep previous plot visible
plot(sqft_living(:,2), sqft_living*theta, 'r-');
legend('Livin area data', 'Linear regression');
hold off;
prices_pred = sqft_living*theta;
residual = prices - prices_pred;
tukeyAnscombePlot(prices_pred, residual, "2");
Above and under the "Zero Line" Top: 9724 Bottom: 11889 Std.dev.: 1530.886181
We see different error varieties:
It can possibly be fixed with the help of a log transformation.

Task 3

"Apply a log-transformation to get the output variable log(‘price‘) and repeat task 2. Additionally, create a histogram of the residuals (number of bins = 100).
Calculate the standard deviation of the residuals and interpret the created plots and values."
% clean loading of the data
prices = table2array( data(:,"price") );
sqft_living = table2array( data(:, "sqft_living"));
prices_log = log(prices);
m = length(prices);
% Add intercept term to sqliving
sqft_living = [ones(m, 1) sqft_living];
theta_log = normalEqn(sqft_living, prices_log);
prices_log_pred = sqft_living*theta_log;

Plot log(prices) with Square Feets of living with Lineare Regression

figure;
hold on; % for new figure
plot(sqft_living(:,2), prices_log, 'o', 'MarkerSize', 1); % Plot the data
title({'Task 3',' Plot log(Prices) & Square feets of living'});
ylabel('Log(Prices)');
xlabel('Living Areas Sq. Ft. 10,000s');
plot(sqft_living(:,2), sqft_living*theta_log, 'r-')
legend('Livin area data', 'Linear regression')

Plot Tukey Anscomble Plot

residual_log = prices_log - prices_log_pred;
tukeyAnscombePlot(prices_log_pred, residual_log, "3");
Above and under the "Zero Line" Top: 11114 Bottom: 10499 Std.dev.: 434.870670
The log transformation has fixed the cone shape and the distribution of the data is much better. We can also count the data between the "Zero Line" and see it is more balanced then before.
But the points have still a top high align to the left side.

Create a histogram of the residuals (number of bins = 100).

figure;
nbins = 100;
histogram(residual_log, nbins);
title({'Task 3',' Histogram: log(Residuan) & Number of Bins (100)'})
ylabel('log(Residuen)'); % Set the y-axis label
xlabel('number of bins'); % Set the x-axis label
hold on;
The histogram shows a skew to the left. It also has smaller outliers to see on the tops.
fprintf("The Standard deviation of the residuals with log('price') is: " + std(residual_log));
The Standard deviation of the residuals with log('price') is: 0.37852
The calculation of the standard deviation from the price with logarithmic is not very bad - but according to the view of the histogram, there could be adjustments and thus make the standard deviation even better or smaller.

Task 4

"Calculate the ‘mean absolute percentage error’ (MAPE) and the ‘median absolute percentage error’ (MdAPE) for the model of task 3. Visualize the distribution of the absolute percentage errors for the models of tasks 2 and 3."
% "Absolute Percentage Error (APE)" of Task 3.
prices_log = log(prices);
LOG_APE = abs( (prices_log-prices_log_pred)./prices_log );
LOG_MAPE = mean( LOG_APE );
fprintf('The MAPE of Task 3 is %f (%4.2f%%) ' , LOG_MAPE, LOG_MAPE*100); % MAPE is 0.023597 -> 2.36%
The MAPE of Task 3 is 0.023597 (2.36%)
LOG_MdAPE = median( LOG_APE );
fprintf('The MdAPE of Task 3 is %f (%4.2f%%) ' , LOG_MdAPE, LOG_MdAPE*100); % MdAPE is 0.020879 -> 2.09%
The MdAPE of Task 3 is 0.020879 (2.09%)
% "Absolute Percentage Error (APE)" of Task 2.
APE = abs( (prices-prices_pred)./prices );
figure;
hold on;
grid on;
ksdensity(APE);
ksdensity(LOG_APE);
title({'Task 4',' Kernel Distribution of the APE of Task 2 & 3'});
legend('APE with price (Task2)', 'APE with log(price) (Task3)');
ylabel('Density'); % Set the y-axis label
xlabel(' APE '); % Set the x-axis label
hold off;
You can clearly see that the logarithmic values from Task 3 are much denser than those without from Task 2. So we are going much better with the logarithmic values.

Task 5

"Investigate the spatial distribution of ‘zipcode‘ and ‘price‘ with a scatter plot by marking the geographic location (‘long‘ / ‘lat‘) of each object with a colored dot representing either ‘zipcode‘ or ‘price‘ in a suitable color-table and -scale. Discuss the relationship between the two variables."
% load data
prices = table2array( data(:,"price") );
zipcode = double(string(table2array( data(:,"zipcode") )));
long = table2array( data(:,"long") );
lat = table2array( data(:,"lat") );

Scatter plot with 'zipcode':

amount_of_zipcode = num2str(length(unique(zipcode)))
amount_of_zipcode = '70'
% scale zipcode between 1 to 0 for coloring
min_zipcode = min(zipcode);
max_zipcode = max(zipcode);
color = (zipcode - min_zipcode) ./ (max_zipcode-min_zipcode);
figure;
geoscatter(lat, long, zipcode./50000 ,color, 'filled');
cbar = colorbar('Ticks', [0, 1], ...
'TickLabels', { num2str(min(zipcode), '%d') , num2str(max(zipcode), '%d')}, ...
'TickLabelInterpreter', "tex");
ylabel(cbar, ' Zipcode');
title(['Task 5: GeoScatter-Plot of ', amount_of_zipcode ,' Zipcodes']);
The zip codes show how states and cities in the USA have been divided in zones. Here you can see how the 70 zones from Washington to Seattle have been divided. From 98001 (dark blue) to 98199 (yellow) direction to Seattel zipcode numbers was increased.
Conclusion: The zip codes with the highest numbers (range from green to yellow) are closest to the city of Seattle.

Scatter plot with 'price':

% scale prices between 1 to 0 for coloring
min_price = min(prices);
max_price = max(prices);
color = (prices - min_price) ./ (max_price-min_price);
figure;
geoscatter(lat, long, prices./200000 ,color, 'filled');
cbar = colorbar('Ticks', [0, 0.25 0.5, 0.75 1], ...
'TickLabels', { num2str(min(prices), '%d'), num2str(max(prices)*0.25, '%d') , num2str(max(prices)*0.5, '%d') , num2str(max(prices)*0.75, '%d') , num2str(max(prices), '%d')}, ...
'TickLabelInterpreter', "tex");
ylabel(cbar, 'Prices');
title({'Task 5',' GeoScatter-Plot of House-Prices'});
The prices of the houses are highest in the center of the city of Seattle and are on the lake. The lighter blue spots are clearly visible. However, you do not see any yellow dots because they are rather rare and lose themselves among the many blue dots.

Scatter plot with 'price' zoomed:

figure;
geoscatter(lat, long, prices./200000 ,color, 'filled');
cbar = colorbar('Ticks', [0, 0.25 0.5, 0.75 1], ...
'TickLabels', { num2str(min(prices), '%d'), num2str(max(prices)*0.25, '%d') , num2str(max(prices)*0.5, '%d') , num2str(max(prices)*0.75, '%d') , num2str(max(prices), '%d')}, ...
'TickLabelInterpreter', "tex");
ylabel(cbar, 'Prices');
title({'Task 5','GeoScatter-Plot of House-Prices','(Zoomed closer to the expensive areas - Center of Seattle)'});
geolimits([47.551 47.726],[-122.340 -122.054])
Again the same plot as before, just zoomed into the center of Seattle. Now you can see the individual very expensive houses or the yellow dots. You can also see that the houses closest to the water are much more expensive than the others.

Task 6

"Feature Engineering: Find a way to encode the variable ‘zipcode‘ into additional features (e.g. using one-hot-encoding, see example on the below). Estimate log( price ) using the newly created features together with 'sqft_living‘ in a linear regression.
Regenerate a Tukey-Anscombe-Plot and a histogram of the residuals and calculate the values for MAPE and MdAPE and discuss the plots and the values in relation to the previous solutions."
% load data
prices = table2array( data(:,"price") );
zipcode = double(string(table2array( data(:,"zipcode") )));
sqft_living = table2array( data(:, "sqft_living"));
% log the prices
prices_log = log(prices);
% create a unique list of zipcodes
uniq_zipcode = unique(zipcode);
% create the "one hot encoding" of zipcodes
[~, loc] = ismember(zipcode, uniq_zipcode);
zipcode_hot_encoding = transpose(full(ind2vec(loc')));
% add and compine [bias | sqft_living ] to the zipcode_hot_encoding
zipcode_hot_encoding = [ones(length(sqft_living), 1), sqft_living, zipcode_hot_encoding];
% create a string-list-zipcode for the table
zipcode_header = strsplit( num2str( uniq_zipcode'));
% creating the zipcode_hot_ecoding_table
zipcode_hot_encoding_table = array2table(zipcode_hot_encoding, 'VariableNames', ["bias", "sqft_living", zipcode_header])
zipcode_hot_encoding_table = 21613×72 table
 biassqft_living98001980029800398004980059800698007980089801098011980149801998022980239802498027980289802998030980319803298033980349803898039980409804298045980529805398055980569805898059980659807098072980749807598077980929810298103981059810698107981089810998112981159811698117981189811998122981259812698133981369814498146981489815598166981689817798178981889819898199
1111800000000000000000000000000000000000000000000000000000000000000000001000
2125700000000000000000000000000000000000000000000000000000000100000000000000
317700000000000000000100000000000000000000000000000000000000000000000000000
4119600000000000000000000000000000000000000000000000000000000000100000000000
5116800000000000000000000000000000000000000100000000000000000000000000000000
6154200000000000000000000000000000010000000000000000000000000000000000000000
7117150010000000000000000000000000000000000000000000000000000000000000000000
8110600000000000000000000000000000000000000000000000000000000000000000000010
9117800000000000000000000000000000000000000000000000000000000000001000000000
10118900000000000000000000000010000000000000000000000000000000000000000000000
11135600000001000000000000000000000000000000000000000000000000000000000000000
12111600000000000000000000000000000000000000000000000000100000000000000000000
13114300000000000000000100000000000000000000000000000000000000000000000000000
14113700000000000000000000000000000000000000100000000000000000000000000000000
15118100000000000000000000000000000000000000000000001000000000000000000000000
16129500000000000000000000000000000000000000000000000000000000010000000000000
17118900000000000010000000000000000000000000000000000000000000000000000000000
18116000000000000000000000000000000000000000000001000000000000000000000000000
19112000100000000000000000000000000000000000000000000000000000000000000000000
20112500010000000000000000000000000000000000000000000000000000000000000000000
21116200000000000000000000000000000000000000000000000000000000001000000000000
22130500000000000000000000000000100000000000000000000000000000000000000000000
23122700000000000000000000000000000000000000000100000000000000000000000000000
24110700000000000000000001000000000000000000000000000000000000000000000000000
25124500000000000000000001000000000000000000000000000000000000000000000000000
26117100100000000000000000000000000000000000000000000000000000000000000000000
27124500000000000000000000000000000000000000000000000000000010000000000000000
28114000000000000000000000000000000000000000000000000001000000000000000000000
29115200000000000000000000000000000000000000000000000000100000000000000000000
30125700000000000000000000000000000100000000000000000000000000000000000000000
31123200000000000000001000000000000000000000000000000000000000000000000000000
32111900000000000000000000000000000000000000000000000000000000001000000000000
33123300000000000000000000000000000000000000000000000000001000000000000000000
34110900000000000000000000000000000000000000000000000000001000000000000000000
35120600000000000000000000000000000000010000000000000000000000000000000000000
36123000000000000000000000000000000000000000000000000000100000000000000000000
37116600000000000000000000000000000100000000000000000000000000000000000000000
38123600000000000000000000000000000000000000000000001000000000000000000000000
39112201000000000000000000000000000000000000000000000000000000000000000000000
40126200000000000000000000000000000000100000000000000000000000000000000000000
41125700000000000000000000000000000000000000100000000000000000000000000000000
42142200000000000000000000000000000000000000000000000000000000000000001000000
43135950000000000000000000000000000010000000000000000000000000000000000000000
44115700000000000000000000000000000000000000000000000000000010000000000000000
45112800000000000000000000000000000000010000000000000000000000000000000000000
46131600000000000010000000000000000000000000000000000000000000000000000000000
4719900000000000000100000000000000000000000000000000000000000000000000000000
48122900000001000000000000000000000000000000000000000000000000000000000000000
49112500000000000000000000000000000000000000000000000000100000000000000000000
50127530000000000000000000000000000000000010000000000000000000000000000000000
51111900000000000000000000000000000000000000000000000000000000000000100000000
52131500000000000000000000000000000000100000000000000000000000000000000000000
53114100000000000000000000000000000000000000000000000000001000000000000000000
54119800000000000000000000000000000000000000000000000000001000000000000000000
55127300000000000000000000000000000000000000000000100000000000000000000000000
56128300000000000000000000000000000000000000000000100000000000000000000000000
57122500000000000000000000000000010000000000000000000000000000000000000000000
58124200000000000000000000000000010000000000000000000000000000000000000000000
59132500000000100000000000000000000000000000000000000000000000000000000000000
60118500000000000000000000000000000000001000000000000000000000000000000000000
61121500000000000000000000000000000000000000000000000000000000000000001000000
62112600000000000000000000000000000000000000000000000000000000000000100000000
63125190000000000000000000000000000000000000000000000000000000000000001000000
64115400000000000000000000000000000000000000000000000000100000000000000000000
65116600000000000000000000000000000000000000000000000000000001000000000000000
66127700000000000000000000000000000000000000000000000000000000000010000000000
67127200001000000000000000000000000000000000000000000000000000000000000000000
68122401000000000000000000000000000000000000000000000000000000000000000000000
69110000000000000000000000000000010000000000000000000000000000000000000000000
70132000001000000000000000000000000000000000000000000000000000000000000000000
71147700000100000000000000000000000000000000000000000000000000000000000000000
72112600000000000000000000000100000000000000000000000000000000000000000000000
73127500000000000000000000000000000000000000000000000000000000100000000000000
74123800000000000000000000000010000000000000000000000000000000000000000000000
75117900000000000000000000000000010000000000000000000000000000000000000000000
76134300000000000000000000000000000000000000010000000000000000000000000000000
77117600000000100000000000000000000000000000000000000000000000000000000000000
78110400000000000000000000000000000000000000000000000000010000000000000000000
79114100000000000000000000000000000000000000000000000000000000001000000000000
80134500000000010000000000000000000000000000000000000000000000000000000000000
81123500000000000000000000000010000000000000000000000000000000000000000000000
82119000000000000000000000000010000000000000000000000000000000000000000000000
83120200000000000000000000000000000000000000000000000000000100000000000000000
84116800000000000000000000000000000000001000000000000000000000000000000000000
8519600000000000000000000000000000000000000000000000000000000100000000000000
86121400000000000000000000000000000000000000000000000000000010000000000000000
87126600000000000000000000000000000000000000000100000000000000000000000000000
88127700000000000000000000000000000000100000000000000000000000000000000000000
89116100000000000000000000000000000000100000000000000000000000000000000000000
90110300000000000000000000000000000000000000000000000000000000000100000000000
91119800000000000000100000000000000000000000000000000000000000000000000000000
92135200000000000000000000000000000000000000000000000000000000000000000000001
93112000000000000000100000000000000000000000000000000000000000000000000000000
94115800000000000000000000000000000000000000000000000000001000000000000000000
95115800000000000000000000000000000000000000000000000000001000000000000000000
96133000000000000000000000000000100000000000000000000000000000000000000000000
97119600000000000000000000010000000000000000000000000000000000000000000000000
98111600000000000000100000000000000000000000000000000000000000000000000000000
99118100000000000000000000000010000000000000000000000000000000000000000000000
100123200000000000000000000000000001000000000000000000000000000000000000000000
% estimate the log(price) with the newly created features and sqft_livin
theta_zipcode_price = normalEqn(zipcode_hot_encoding, prices_log);
% predicted prices
prices_log_pred = zipcode_hot_encoding*theta_zipcode_price;
% calculate the residual
residual_log = prices_log - prices_log_pred;

Tukey Anscomble Plot of residuals

tukeyAnscombePlot(prices_log_pred, residual_log, "6");
Above and under the "Zero Line" Top: 10989 Bottom: 10624 Std.dev.: 258.093975
The scape of the distribution looks definitely better, which is positive good, but the balance of the distribution has changed little to be better.

Histogram of the residuals

figure;
nbins = 100;
histogram(residual_log, nbins);
title({'Task 6',' Histogram: log(Residuan) & Number of Bins ( 100 )'});
ylabel('log(Residuen)');
xlabel('number of bins');
The histogram also looks great like the Tukey-Anscombe plot, it has also got a cleaner shape. But the histogram is still skewed on the left.
fprintf("The Standard deviation of the residuals is: " + std(residual_log));
The Standard deviation of the residuals is: 0.22166
The standard deviation is better then before (see Task 3).

Calculate the values for MAPE and MdAPE

% "Absolute Percentage Error (APE)" of Task 6.
prices_log = log(prices);
LOG_APE = abs( (prices_log-prices_log_pred)./prices_log );
LOG_MAPE = mean( LOG_APE );
fprintf('MAPE of Task 6 is %f (%4.2f%%) ' , LOG_MAPE, LOG_MAPE*100); % MAPE is 0.012250 -> 1.22%
MAPE of Task 6 is 0.012250 (1.22%)
LOG_MdAPE = median( LOG_APE );
fprintf('MdAPE of Task 6 is %f (%4.2f%%) ' , LOG_MdAPE, LOG_MdAPE*100); % MdAPE is 0.008987 -> 0.90%
MdAPE of Task 6 is 0.008987 (0.90%)
The mean and median relative distance of the true measured values is accordingly 1.22% (MAPE) and 0.90% (MdAPE).

Task 7

Repeat all steps of task 6) using [‘bedrooms‘, ‘bathrooms‘, ‘grade‘, ‘yr_built'] as additional features. Discuss the outcome differences between tasks 6) and 7).
% load data
prices = table2array( data(:,"price") );
sqft_living = table2array( data(:, "sqft_living"));
zipcode = double(string(table2array( data(:,"zipcode") )));
bedrooms = double(string(table2array( data(:,"bedrooms") )));
bathrooms = double(string(table2array( data(:,"bathrooms") )));
grade = double(string(table2array( data(:,"grade") )));
yr_built = double(string(table2array( data(:,"yr_built") )));
% log the prices
prices_log = log(prices);
% create the "one hot encoding" of the features
zipcode_ohe = OneHotEncoding(zipcode);
bedrooms_ohe = OneHotEncoding(bedrooms);
bathrooms_ohe = OneHotEncoding(bathrooms);
grade_ohe = OneHotEncoding(grade);
yr_built_ohe = OneHotEncoding(yr_built);
% add and compine [bias | sqft_living ] to the zipcode_hot_encoding
many_features_ohe = [ones(length(prices), 1), sqft_living, zipcode_ohe, bedrooms_ohe, bathrooms_ohe, grade_ohe, yr_built_ohe]
many_features_ohe = 21613×243
1 1180 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2570 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 770 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1960 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1680 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 5420 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1715 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1060 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1780 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1890 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
% estimate the log(price) with the newly created features and sqft_livin
theta_many_features_price = normalEqn(many_features_ohe, prices_log);
% predicted prices
prices_many_features_pred = many_features_ohe*theta_many_features_price;
% calculate the residual
residual_many_features_log = prices_log - prices_many_features_pred;

Tukey Anscomble Plot of residuals

tukeyAnscombePlot(prices_many_features_pred, residual_many_features_log, "7");
Above and under the "Zero Line" Top: 10786 Bottom: 10827 Std.dev.: 28.991378
The scattering of the data points is now even broader and still very even. The balance looks great, as the distribution between the zero line now has a much lower value.

Histogram of the residuals

figure;
nbins = 100;
histogram(residual_many_features_log, nbins);
title({'Task 7','Histogram: log(Residuan) & Number of Bins ( 100 )'});
ylabel('log(Residuen)');
xlabel('number of bins');
hold on;
The histogram is very well balanced. No left or right skewed anymore.
fprintf("The Standard deviation of the residuals is: " + std(residual_many_features_log));
The Standard deviation of the residuals is: 0.19925
The standard deviation is better then before (see Task 3).

Calculate the values for MAPE and MdAPE

% "Absolute Percentage Error (APE)" of Task 6.
LOG_APE = abs( (prices_log-prices_many_features_pred)./prices_log );
LOG_MAPE = mean( LOG_APE );
fprintf('MAPE of Task 7 is %f (%4.2f%%) ' , LOG_MAPE, LOG_MAPE*100); % MAPE is 0.011048 -> 1.12%
MAPE of Task 7 is 0.011048 (1.10%)
LOG_MdAPE = median( LOG_APE );
fprintf('MdAPE of Task 7 is %f (%4.2f%%) ' , LOG_MdAPE, LOG_MdAPE*100); % MdAPE is 0.008143 -> 0.81%
MdAPE of Task 7 is 0.008143 (0.81%)
The mean and median relative distance of the true measured values is accordingly 1.12% (MAPE) and 0.81% (MdAPE) and a little lower then before (see Task 6).

Task 8

"Recreate the model of task 7) with the normal equation for linear regression and regularize it for a suitable range of the regularization parameter ...."
% load data
prices = table2array( data(:,"price") );
sqft_living = table2array( data(:, "sqft_living"));
zipcode = double(string(table2array( data(:,"zipcode") )));
bedrooms = double(string(table2array( data(:,"bedrooms") )));
bathrooms = double(string(table2array( data(:,"bathrooms") )));
grade = double(string(table2array( data(:,"grade") )));
yr_built = double(string(table2array( data(:,"yr_built") )));
% create the "one hot encoding" of the features
zipcode_ohe = OneHotEncoding(zipcode);
bedrooms_ohe = OneHotEncoding(bedrooms);
bathrooms_ohe = OneHotEncoding(bathrooms);
grade_ohe = OneHotEncoding(grade);
yr_built_ohe = OneHotEncoding(yr_built);
% add and compine one-hots-encodings
sqft_living = (sqft_living - min(sqft_living)) ./ (max(sqft_living)- min(sqft_living));
many_features_ohe = [ones(length(prices), 1) sqft_living zipcode_ohe bedrooms_ohe bathrooms_ohe grade_ohe yr_built_ohe]
many_features_ohe = 21613×243
1.0000 0.0672 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0.1721 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0.0362 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0.1260 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0.1049 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0 0 0 0 0 0 0 0 0 0 1.0000 0.3872 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0.1075 0 0 1.0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0.0581 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0.1125 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0.1208 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
% shuffel rows
many_features_ohe = many_features_ohe(randperm(size(many_features_ohe,1)),:);
% create random a train (80%) and test (20%) set
size_train = round(length(many_features_ohe)*0.8);
size_test = round(length(many_features_ohe)*0.2);
many_features_train = many_features_ohe(1:size_train,:);
many_features_test = many_features_ohe(size_train+1:end,:);
% log the prices
prices_log = log(prices);
% the true y
prices_log_train = prices_log(1:size_train,:);
prices_log_test = prices_log(size_train+1:end, :);

Selecting lambda using a test set

lambda_vec = [0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 0.2 0.04 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 15 20 30 50 70 90 100 150];
m = length(lambda_vec);
thetas_sum = zeros(size(lambda_vec,2),1);
prices_test_pred = zeros(size(many_features_test,1), m);
for i=1:m
lambda = lambda_vec(i);
theta = normalEqnLambda(many_features_train, prices_log_train, lambda);
thetas_sum(i) = sum(theta(2:end) .^ 2); %sum(theta.^2)
prices_test_pred(:,i) = many_features_test*theta;
end
APE_test = abs( (prices_log_test-prices_test_pred)./prices_log_test );
MAPE_test = mean( APE_test );
MdAPE_test = median( APE_test );
for i = 1:length(lambda_vec)
if i == 1
fprintf('lambda\t\t MAPE test\t MdAPE test\t Theta sum \n');
end
fprintf('%f\t%f\t%f\t%f\n',lambda_vec(i), MAPE_test(i), MdAPE_test(i), thetas_sum(i));
end
lambda MAPE test MdAPE test Theta sum
0.000000 0.030469 0.025428 27.096320 0.001000 0.030454 0.025428 1.859713 0.002000 0.030454 0.025428 1.857639 0.003000 0.030454 0.025428 1.855571 0.004000 0.030454 0.025428 1.853508 0.005000 0.030454 0.025428 1.851450 0.006000 0.030454 0.025428 1.849398 0.007000 0.030454 0.025428 1.847351 0.008000 0.030454 0.025428 1.845309 0.009000 0.030454 0.025428 1.843273 0.010000 0.030454 0.025428 1.841242 0.200000 0.030452 0.025441 1.533232 0.040000 0.030454 0.025427 1.782708 0.500000 0.030450 0.025438 1.239558 0.600000 0.030449 0.025436 1.171214 0.700000 0.030448 0.025425 1.112507 0.800000 0.030448 0.025423 1.061544 0.900000 0.030447 0.025429 1.016891 1.000000 0.030446 0.025436 0.977444 2.000000 0.030440 0.025447 0.742061 3.000000 0.030436 0.025449 0.630961 4.000000 0.030431 0.025427 0.564219 5.000000 0.030428 0.025400 0.518377 6.000000 0.030424 0.025439 0.484131 7.000000 0.030421 0.025444 0.457058 8.000000 0.030418 0.025438 0.434783 9.000000 0.030415 0.025447 0.415910 10.000000 0.030412 0.025419 0.399559 15.000000 0.030401 0.025412 0.340292 20.000000 0.030392 0.025431 0.300877 30.000000 0.030377 0.025276 0.248189 50.000000 0.030357 0.025207 0.186649 70.000000 0.030341 0.025169 0.149750 90.000000 0.030329 0.025170 0.124541 100.000000 0.030324 0.025150 0.114648 150.000000 0.030306 0.025190 0.080703
[minMAPE, iMAPE] = min(MAPE_test);
fprintf('Lambda value %f give the smallest MAPE value %f', lambda_vec(iMAPE), minMAPE);
Lambda value 150.000000 give the smallest MAPE value 0.030306

I don't know the issue here, why the biggest lambda give the best MAPE ?!? :-/

% calculate the test residual
residual_test = prices_log_test - prices_test_pred;

Tukey Anscomble Plot of the test residuals

tukeyAnscombePlot(prices_test_pred, residual_test, "8");
Above and under the "Zero Line" Top: 2254 Bottom: 2069 Std.dev.: 130.814755

Histogram of the residuals

figure;
nbins = 100;
histogram(residual_test, nbins);
title({'Task 8','Histogram: log(Residuan) & Number of Bins ( 100 )'});
ylabel('log(Residuen)');
xlabel('number of bins');
The histogram is almost very well balanced in the random iterations. No much left or right skewed anymore.
fprintf("The Standard deviation of the residuals is: " + std(residual_many_features_log));
The Standard deviation of the residuals is: 0.19925
The standard deviation is better then before (see Task 3).

Task 9

"What would be needed if you wanted to optimize the MAPE as cost function instead of the ‘residual-sum-of-squares’?
Try to derive the corresponding formulas and give a sketch of the resulting algorithm e.g. in pseudo-code."
So I have the APE:
The I take the Cost-Functione: ,
replace it with MAPE = mean (APE